Through this project I am trying to find the top 5 most expensive cities to stay in around the world based on various factors like education, food, housing, transportation, income, entertainment and clothing costs. I am also trying to find a correlation between these factors and analyze the role each factor plays in deciding the cost of living in a city and find the factors the cost of living is highly correlated to. I would also be developing a prediction model that would predict the cost of living of a particular city given the values of each of the predictor variables. The complete (full model) and reduced model would be compared to find which of the two models is more accurate.
I decided to work on this data set as I found it very interesting and something that would be help me get globally aware and analyze all factors that go into deciding the cost of living in a particular city.
The data set is taken from Kaggle.com (https://www.kaggle.com/stephenofarrell/cost-of-living) The data set is collected from the popular website numbeo, it gives a comparison of the cost of living in various cities, The currency is in Euro.The data set consists of 160 cities as columns and 55 items as rows (eg cost of Coke/Pepsi (0.33 liter bottle), cost of One-way Ticket (Local Transport), Price per Square Meter to Buy Apartment in City Centre etc) on which comparison is drawn.
Here each of the 55 items are the rows and there are 160 cities as columns, as there are a lot of rows and columns I have just displayed the first 15 items and first 15 cities to get an idea of how the data set looks like
## The first 15 rows and first 15 columns of the original dataset are:
## X
## 1 Meal, Inexpensive Restaurant
## 2 Meal for 2 People, Mid-range Restaurant, Three-course
## 3 McMeal at McDonalds (or Equivalent Combo Meal)
## 4 Domestic Beer (0.5 liter draught)
## 5 Imported Beer (0.33 liter bottle)
## 6 Coke/Pepsi (0.33 liter bottle)
## 7 Water (0.33 liter bottle)
## 8 Milk (regular), (1 liter)
## 9 Loaf of Fresh White Bread (500g)
## 10 Eggs (regular) (12)
## 11 Local Cheese (1kg)
## 12 Water (1.5 liter bottle)
## 13 Bottle of Wine (Mid-Range)
## 14 Domestic Beer (0.5 liter bottle)
## 15 Imported Beer (0.33 liter bottle)
## Saint.Petersburg..Russia Istanbul..Turkey Izmir..Turkey Helsinki..Finland
## 1 7.34 4.58 3.06 12.00
## 2 29.35 15.28 12.22 65.00
## 3 4.40 3.82 3.06 8.00
## 4 2.20 3.06 2.29 6.50
## 5 2.20 3.06 2.75 6.75
## 6 0.76 0.64 0.61 2.66
## 7 0.53 0.24 0.22 1.89
## 8 0.98 0.71 0.65 0.96
## 9 0.71 0.36 0.38 2.27
## 10 1.18 1.62 1.51 2.02
## 11 7.60 5.32 4.97 6.87
## 12 0.63 0.33 0.29 1.54
## 13 5.87 7.64 6.11 12.00
## 14 0.88 1.79 1.63 2.23
## 15 1.89 2.48 2.09 2.95
## Chisinau..Moldova Milan..Italy Cairo..Egypt
## 1 4.67 15.00 3.38
## 2 20.74 60.00 17.48
## 3 4.15 8.00 4.51
## 4 1.04 5.00 1.69
## 5 1.43 5.00 2.82
## 6 0.64 2.49 0.29
## 7 0.44 1.12 0.18
## 8 0.68 1.21 0.84
## 9 0.33 1.94 0.51
## 10 1.11 2.87 1.04
## 11 5.79 13.54 3.16
## 12 0.59 0.40 0.29
## 13 3.61 7.00 10.72
## 14 0.77 1.20 1.40
## 15 1.38 2.01 2.06
## Banja.Luka..Bosnia.And.Herzegovina Baku..Azerbaijan Guadalajara..Mexico
## 1 3.58 5.27 5.25
## 2 22.99 23.73 23.86
## 3 3.58 4.22 4.25
## 4 1.02 0.84 1.43
## 5 1.53 2.11 2.39
## 6 1.18 0.37 0.66
## 7 0.74 0.21 0.51
## 8 0.69 0.83 0.87
## 9 0.63 0.26 1.35
## 10 1.47 1.03 1.25
## 11 4.76 3.54 4.09
## 12 0.48 0.43 0.73
## 13 4.34 4.75 7.16
## 14 0.50 0.76 0.82
## 15 0.86 1.94 1.48
## Kathmandu..Nepal Hanoi..Vietnam Ho.Chi.Minh.City..Vietnam
## 1 1.99 1.94 1.94
## 2 11.92 15.52 17.50
## 3 5.56 3.88 3.87
## 4 2.38 0.78 0.78
## 5 3.18 1.55 1.36
## 6 0.45 0.45 0.43
## 7 0.20 0.28 0.26
## 8 0.67 1.32 1.21
## 9 0.43 0.71 0.66
## 10 1.29 1.31 1.26
## 11 7.48 14.41 11.35
## 12 0.28 0.59 0.47
## 13 6.75 7.76 9.72
## 14 1.78 0.65 0.73
## 15 2.50 1.18 1.13
## Mexico.City..Mexico
## 1 4.77
## 2 23.86
## 3 4.77
## 4 1.91
## 5 2.86
## 6 0.61
## 7 0.47
## 8 0.91
## 9 1.44
## 10 1.36
## 11 4.93
## 12 0.66
## 13 7.16
## 14 1.08
## 15 2.15
The data set consistes of 55 items on which comparison is drawn. I have therefore created a function that first makes the rows as columns and vice versa resulting the various cities to become rows and various items as columns. The function then groups each of the 55 items in 7 major categories i.e food_beverages cost, transportation cost, housing_utility cost, entertainment_fitness cost, clothing_footwear cost, the gross income and education cost. The 8th column gives the sum of all these categories and gives the total cost of living as a whole.Columns that were added to each of the categories are visible in the function definition below.
For a user to successfully run the function he/she should pass the correct data set else the function would return an error as seen in the function definition below.
data_transformation <- function(data) {
if((all.equal(data,cost)) == "FALSE"){
cat("Error, wrong data set passed, transformation cannot be made")
}
else{
df2 <- data.frame(t(data[-1]))
colnames(df2) <- data[, 1]
food_beverages <- df2[,"Meal, Inexpensive Restaurant"] + df2[,"Meal for 2 People, Mid-range Restaurant, Three-course"] + df2[,"McMeal at McDonalds (or Equivalent Combo Meal)"]
+ df2[,"Domestic Beer (0.5 liter draught)"] + df2["Imported Beer (0.33 liter bottle)"] + df2[,"Coke/Pepsi (0.33 liter bottle)"] + df2[,"Water (0.33 liter bottle) "] +
df2[,"Milk (regular), (1 liter)"] + df2[,"Loaf of Fresh White Bread (500g)"] + df2[,"Eggs (regular) (12)"] + df2[,"Local Cheese (1kg)"] + df2[,"Water (1.5 liter bottle)"] +
df2[,"Bottle of Wine (Mid-Range)"] + df2[,"Domestic Beer (0.5 liter bottle)"] + df2[,"Imported Beer (0.33 liter bottle)"] + df2[,"Cigarettes 20 Pack (Marlboro)"] +
df2[,"Chicken Breasts (Boneless, Skinless), (1kg)"] + df2[,"Apples (1kg)"] + df2[,"Oranges (1kg)"] + df2[,"Potato (1kg)"] + df2[,"Lettuce (1 head)"] + df2[,"Cappuccino (regular)"]
+ df2[,"Rice (white), (1kg)"] + df2[,"Tomato (1kg)"] + df2[,"Banana (1kg)"] + df2[,"Onion (1kg)"] +df2[,"Beef Round (1kg) (or Equivalent Back Leg Red Meat)"]
transportation <- df2[,"One-way Ticket (Local Transport)"] + df2[,"Monthly Pass (Regular Price)"] + df2[,"Gasoline (1 liter)"] + df2[,"Volkswagen Golf"] + df2[,"Taxi 1km (Normal Tariff)"]
+ df2[,"Taxi Start (Normal Tariff)"] + df2[,"Taxi 1hour Waiting (Normal Tariff)"] + df2[,"Toyota Corolla 1.6l 97kW Comfort (Or Equivalent New Car)"]
living_utility <- df2[,"Apartment (1 bedroom) in City Centre"] + df2[,"Apartment (1 bedroom) Outside of Centre"] + df2[,"Apartment (3 bedrooms) in City Centre"] +
df2[,"Apartment (3 bedrooms) Outside of Centre"] + df2[,"Basic (Electricity, Heating, Cooling, Water, Garbage) for 85m2 Apartment"] +
df2[,"1 min. of Prepaid Mobile Tariff Local (No Discounts or Plans)"] + df2[,"Internet (60 Mbps or More, Unlimited Data, Cable/ADSL)"] + df2[,"Price per Square Meter to Buy Apartment in City Centre"]
+ df2[,"Price per Square Meter to Buy Apartment Outside of Centre"] + + df2[,"Mortgage Interest Rate in Percentages (%), Yearly, for 20 Years Fixed-Rate"]
entertainment_fitness <- df2[,"Fitness Club, Monthly Fee for 1 Adult"] + df2[,"Tennis Court Rent (1 Hour on Weekend)"] + df2[,"Cinema, International Release, 1 Seat"]
clothing_footwear <- df2[,"1 Pair of Jeans (Levis 501 Or Similar)"] + df2[,"1 Summer Dress in a Chain Store (Zara, H&M, ...)"] + df2[,"1 Pair of Nike Running Shoes (Mid-Range)"]
+ df2[,"1 Pair of Men Leather Business Shoes"]
income <- df2[,"Average Monthly Net Salary (After Tax)"]
education <- df2[,"Preschool (or Kindergarten), Full Day, Private, Monthly for 1 Child"] + df2[,"International Primary School, Yearly for 1 Child"]
cost_of_living <- food_beverages + transportation + living_utility + entertainment_fitness + clothing_footwear + income + education
countries <- rownames(df2)
df3<- data.frame("food_beverages" = food_beverages , "transportation" = transportation, "housing_utility" = living_utility , "entertainment_fitness" = entertainment_fitness, "clothing_footwear" = clothing_footwear, "income" = income, "education" = education, "cost_of_living" = cost_of_living)
data<- data.frame(df3, row.names = countries)
print("After transformation the data set looks like")
return(data)
}
}
The first 5 rows after applying this tranformation function looks like below:
data = data_transformation(cost)
## [1] "After transformation the data set looks like"
print(head(data))
## food_beverages transportation housing_utility
## Saint.Petersburg..Russia 41.09 19333.47 5081.65
## Istanbul..Turkey 23.68 24476.40 2718.57
## Izmir..Turkey 18.34 24468.14 1840.27
## Helsinki..Finland 85.00 22065.04 12385.12
## Chisinau..Moldova 29.56 16345.71 2208.89
## Milan..Italy 83.00 22443.71 13986.53
## entertainment_fitness clothing_footwear income
## Saint.Petersburg..Russia 49.74 184.99 645.94
## Istanbul..Turkey 44.58 123.37 436.00
## Izmir..Turkey 35.37 108.42 392.07
## Helsinki..Finland 73.26 197.31 2382.56
## Chisinau..Moldova 43.93 178.16 286.95
## Milan..Italy 99.58 205.77 1601.00
## education cost_of_living
## Saint.Petersburg..Russia 5800.69 31137.57
## Istanbul..Turkey 7188.37 35010.97
## Izmir..Turkey 5160.59 32023.20
## Helsinki..Finland 1992.60 39180.89
## Chisinau..Moldova 2889.82 21983.02
## Milan..Italy 11379.17 49798.76
After sorting the data according to the cost_of_living the top 5 most expensive cities that can be concluded from the data set are as follows, also the cost_of_living in the top 5 cities can be seen in the plot.
## [1] "Singapore..Singapore" "125903.91"
## [1] "New.York..NY..United.States" "88214.21"
## [1] "Hong.Kong..Hong.Kong" "87735.27"
## [1] "San.Francisco..CA..United.States" "81904.66"
## [1] "Zurich..Switzerland" "79512"
## The minimum cost of living among all cities is: 14183.98
## The maximum cost of living among all cities is 125903.9
## The range of cost of living among all cities is 111719.9
Correlation analysis is a statistical method used to evaluate the strength of relationship between two quantitative variables. A high correlation means that two or more variables have a strong relationship with each other, while a weak correlation means that the variables are hardly related. “ggpubr” library can be used to find correlation between the various variables.
## [1] "Using the library ggpubr, correlation between the variables is:"
## food_beverages transportation housing_utility
## food_beverages 1.0000000 0.2316255 0.6822695
## transportation 0.2316255 1.0000000 0.3815669
## housing_utility 0.6822695 0.3815669 1.0000000
## entertainment_fitness 0.7309782 0.3481898 0.7040342
## clothing_footwear 0.6353724 0.3744455 0.4173999
## income 0.8297093 0.1814270 0.7565999
## education 0.6294370 0.2422842 0.7958883
## cost_of_living 0.6581109 0.6753583 0.8964769
## entertainment_fitness clothing_footwear income
## food_beverages 0.7309782 0.6353724 0.8297093
## transportation 0.3481898 0.3744455 0.1814270
## housing_utility 0.7040342 0.4173999 0.7565999
## entertainment_fitness 1.0000000 0.5779411 0.6465545
## clothing_footwear 0.5779411 1.0000000 0.3259875
## income 0.6465545 0.3259875 1.0000000
## education 0.6267409 0.2523154 0.7972912
## cost_of_living 0.7007316 0.4388106 0.7402286
## education cost_of_living
## food_beverages 0.6294370 0.6581109
## transportation 0.2422842 0.6753583
## housing_utility 0.7958883 0.8964769
## entertainment_fitness 0.6267409 0.7007316
## clothing_footwear 0.2523154 0.4388106
## income 0.7972912 0.7402286
## education 1.0000000 0.8422262
## cost_of_living 0.8422262 1.0000000
From the result above we can say the cost_of_living is greatly correlated to housing_utility cost,education cost and gross income
##
## Call:
## lm(formula = cost_of_living ~ food_beverages + transportation +
## housing_utility + entertainment_fitness + clothing_footwear +
## income + education, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.273e-11 -4.039e-12 -1.356e-12 1.986e-12 2.750e-10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.402e-12 1.079e-11 -5.010e-01 0.617
## food_beverages 1.000e+00 1.961e-13 5.100e+12 <2e-16 ***
## transportation 1.000e+00 3.091e-16 3.235e+15 <2e-16 ***
## housing_utility 1.000e+00 6.112e-16 1.636e+15 <2e-16 ***
## entertainment_fitness 1.000e+00 1.471e-13 6.799e+12 <2e-16 ***
## clothing_footwear 1.000e+00 9.004e-14 1.111e+13 <2e-16 ***
## income 1.000e+00 3.909e-15 2.558e+14 <2e-16 ***
## education 1.000e+00 5.888e-16 1.698e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.386e-11 on 152 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.09e+31 on 7 and 152 DF, p-value: < 2.2e-16
## Checking accuracy of model by giving ground truth data
## Predicting cost of living when the values of 1st row are given i.e for the city Saint.Petersburg..Russia is :
## 1
## 31137.57
The prediction model gave really accurate results on ground data as seen above for Saint.Petersburg..Russia. The normal QQ_Plot is consistent with linearity therefore we can conclude the experimental errors of the data are normally distributed. Also the plot between fitted values and residuals seems to be equally distributed and therefore there seems to be Homogeneity of variances. Thus the above plots show that the ANOVA assumptions seem to be satisfied.
The full prediction model consists of taking into consideration all the predictor variables i.e food_beverages cost,transportation cost, housing_utility cost, entertainment_fitness cost, clothing_footwear cost, gross income and education cost. Whereas the reduced model consists of taking into consideration only the predictor variables that are the most greatly correlated to the cost of living i.e housing_utility cost, education cost and gross income
Null Hypothesis : Reduced Model is appropriate
Alternative Hypothesis : Reduced Model is not appropriate
## Analysis of Variance Table
##
## Model 1: cost_of_living ~ housing_utility + income + education
## Model 2: cost_of_living ~ food_beverages + transportation + housing_utility +
## entertainment_fitness + clothing_footwear + income + education
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 156 6557351600
## 2 152 0 4 6557351600 2.8799e+30 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## The AIC value of Full model is: 8 -7819.046
## The AIC value of Reduced model is: 4 2812.589
## The output of stepwise function is:
## Start: AIC=-7819.05
## cost_of_living ~ food_beverages + transportation + housing_utility +
## entertainment_fitness + clothing_footwear + income + education
##
## Df Sum of Sq RSS AIC
## <none> 0 -7819.0
## - food_beverages 1 14803 14803 738.4
## - entertainment_fitness 1 26312 26312 830.4
## - clothing_footwear 1 70213 70213 987.5
## - income 1 37250886 37250886 1991.3
## - housing_utility 1 1523849207 1523849207 2585.1
## - education 1 1642022798 1642022798 2597.0
## - transportation 1 5956198007 5956198007 2803.2
The p-value of ANOVA is 2.2e-16, which implies we reject the null hypothesis. Conclusion : Reduced model is not appropriate.
Akaike information criterion (AIC) (Akaike, 1974) is a fined technique based on in-sample fit to estimate the likelihood of a model to predict/estimate the future values. A good model is the one that has minimum AIC among all the other models. The full models has a lower AIC value as compared to reduced model. The stepwise method also suggests the full model as better model. Therefore ANOVA, AIC values and stepwise method all suggest full model is better than the reduced model.
## food_beverages transportation housing_utility
## 6.668225 1.326356 3.888114
## entertainment_fitness clothing_footwear income
## 2.853446 2.577585 6.747972
## education
## 3.859116
Multicollinearity is a term used to describe when two or more predictors in the regression model are highly correlated. VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. More variation is bad news; If the variance of the coefficients increases, the model isn’t going to be as reliable. Assuming threshold is 10, there seems to be no issue of multicollinearity in the model as all predictor variables have vif value less than 10.
Through the transformation function the rows and columns were interchanged and the 55 items were grouped into 7 major categories for effective and simpler analysis. The cost of living for each city was calculated as the sum of all the 7 cost variables i.e food_beverages cost, transportation cost, housing_utility cost, entertainment_fitness cost, clothing_footwear cost, education cost and gross income. The data set was then sorted in decreasing order of cost of living and I thus obtained the top 5 most expensive cities to stay in around the world. The shinny app can be used to find a particular row in the transformed data set and also the cost of living of a particular city from the top by inputting an integer in each of text box respectively..
Further I used the “ggpubr” library to calculate correlation between the 7 variables to find the variables that were most strongly correlated to the cost of living and found that housing_utility cost,education cost and gross income were the most highly correlated and thus I could imply these 3 factors can greatly impact the cost of living in a city. I developed a prediction model to predict the cost of living in a particular city when the values of each of the predictor variables were given. Evaluated the performance of the model on the given data and got really accurate results. Also, compared the full and reduced model(considering only variables having high correlation to the cost of living) found that the full model was better than the reduced model. This decision was based on the results of ANOVA, AIC value and stepwise method. Also there seemed to be no issue of multicollinearity based on the vif value.
This data set considers only values for curent year. In future, data for mutiple years can be collected and then similar methods can be used to get more accurate and better results.
I am a Computer Science graduate student specializing in the field of Data Science and Machine Learning. I like to call myself a “Data Enthusiast” as I love to work around data, solve statistical questions and build intelligent models